Question: What might affect the chance of getting heart disease?
- We have the data from Western Collaborative Group Study contains 3154 healthy men, aged from 39 to 59, from the San Francisco area.
- At the start of the study, all were free of heart disease.
- Eight and a half years later, the study recorded whether these men now suffered from coronary heart disease (
chd). - Other recorded variables that might be related to the chance of developing this disease are:
age- age in yearsheight- height in inchesweight- weight in poundssdp- systolic blood pressure in mm Hgdbp- diastolic blood pressure in mm Hgchol- fasting serum cholesterol in mm %behave- behaviour type (A1, A2, B3, B4)cigs- number of cigarettes smoked per daydibep- behaviour type (A = Aggressive, P = Passive)typechd- type of coronary heart disease (angina, infdeath, none, or silent)timechd- time of coronary heart disease or end of follow-uparcus- arcus senilis (absent or present)
Load the data
- Is this data experimental or observational?
Explore the data (initial data analysis)
- There are a number of ways to explore data.
- First let’s start by looking at the distribution of the outcome (coronary heart disease). What do you notice?
- Next let’s looks at the marginal distribution of numerical covariates by the outcome. Why is the median value for
timechdclustered around 3000 for those that didn’t have a coronary heart disease?
Answer
The study went on for 8.5 years (which is about 8.5 \(\times\) 365.25 = 3105 days). The variable timechd for the group that didn’t have a coronary heart disease is related to number of days to end of follow-up. This means that those with earlier time (e.g. 1000 days) within this group means that they dropped out of study for some reason (e.g. due to death or unable to contact participant).
- Let’s look also at the marginal distribution of categorical variables by the outcome. Do the levels within a factor have a higher association with
chd? What do you notice in particular abouttypechd?
Answer
typechd is derived from chd so those that did not have a coronary heart disease is all assigned “none” as expected.
ggpairs()fromGGallypackage is useful for looking at the pairwise relationship between any two covariates in the data. It will be slow to compute if you have many variables (and individual graph may be too slow to see) so you may need to subset to a smaller number of variables first.
- Let’s zoom in and have a look at the relationship between
sdpanddbp. What do you notice about the relationship between these variables?
Answer
The variablessdp and dbp are highly correlated. Systolic pressure (sdp) is the maximum blood pressure during contraction of the ventricles; while diastolic pressure (dbp) is the minimum pressure recorded just prior to the next contraction. Both measure blood pressure (in different ways) so the high correlation is perhaps expected!
- What is the relationship between
behaveanddibep?
Answer
All those that are assigned with value “A” fordibep are either “A1” or “A2” for behave. Similarly all that are assigned with value “B” for dipep are either “B1” or “B2” for behave. This suggests that perhaps behave was a further refinement of behaviour type based on dibep (so behave is nested within dibep).
Model the data
- Why would you not (or would you) use
typechdandtimechdas predictors in the model?
Answer
The variables typechd and timechd are calculated based on the outcome! You can’t use predictors that were derived based on the outcome.
- Do you think the behaviour type variable (
behave) should be included in the model?
Answer
behave does not appear to be (statistically) significantly contribute to explaining the chd so we omit from the model.
- What do you think the best model that explains
chdis?
Answer
Variable selection is a hard problem! We can use stepwise selection (which goes through backward selection - drop the least signiciant variable - and forward selection - add the most significant variable, until it meets a certain criteria), but an “automated” selection like this doesn’t account for the domain context.
For example, sdp is in the final model selected by stepwise selection but sdp is highly correlated with dbp. Should dbp been included instead of sdp? Also weight is included but not height. Tall people would naturally weight more than short people with the same body type. Would it have been better to feature engineer another variable that normalises weight with respect to height? Body mass index (which is weight in kg divided by square of body height in metres) is supposed to account for weight with respect to height.